-
Notifications
You must be signed in to change notification settings - Fork 19.6k
feat(quantization): Add GPTQ n-bit quantization support #21551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @amitsrivastava78, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented support for GPTQ (Generative Pre-trained Transformer Quantization) n-bit quantization within Keras. This feature allows for significantly reducing the memory footprint and improving the inference speed of large language models by quantizing their weights to lower precision, such as 4-bit. The integration provides a streamlined way to apply GPTQ to Keras models, enabling more efficient deployment on resource-constrained hardware while aiming to maintain model performance.
Highlights
- New GPTQ Quantization Mode: I've introduced a new 'gptq' quantization mode, expanding the existing
QUANTIZATION_MODES
to support this advanced n-bit quantization technique. - Extended
model.quantize()
Method: Themodel.quantize()
method has been updated to recognize and process the 'gptq' mode. This now requires a dedicatedGPTQConfig
object, which encapsulates all necessary parameters for the GPTQ algorithm, ensuring proper configuration and execution. - New Quantization Modules: I've added a new
quantizers
directory containing the core logic for GPTQ. This includes theGPTQ
class for layer-specific operations,GPTQConfig
for overall parameter management,gptqutils
for data loading and layer-wise application, andquant
for fundamental quantization functions. - Enhanced Testing for GPTQ: Comprehensive unit tests have been added and updated in
model_test.py
to validate the GPTQ implementation. These tests cover various scenarios, including different dataset types (in-memory, generator, and public datasets like WikiText2) and ensure the quantized models retain functionality.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces GPTQ n-bit quantization support to Keras, a significant feature. The implementation is well-structured, separating configuration, core logic, and utilities. The changes include a new GPTQConfig
, integration into model.quantize()
, the GPTQ algorithm implementation, and corresponding tests. My review has identified a few areas for improvement: removing leftover debug code, fixing an inconsistent error message, using HTTPS for downloads, addressing some dead code in tests, and refactoring for maintainability by reducing code duplication. Additionally, there's a performance consideration regarding tensor concatenation within a loop that could be optimized.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #21551 +/- ##
==========================================
+ Coverage 82.75% 82.79% +0.04%
==========================================
Files 567 571 +4
Lines 56471 56807 +336
Branches 8818 8883 +65
==========================================
+ Hits 46730 47033 +303
- Misses 7580 7597 +17
- Partials 2161 2177 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! I have left a few initial comments.
It would be helpful to attach colabs to the PR description showing improvements over raw 4-bit quantization for models that this feature has been tested with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! There is a lot going on.
This is just a first pass, mostly high level / API comments.
Colab attached in the PR now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR! left a few comments.
The code here is lacking test coverage. The codecov report also suggests the same - 62.40409% of the code is missing test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the expectations is that after quantization, it should be possible to
- run the model (using the quantized kernels)
- save it (keeping it quantized)
- reload it (keeping it quantized)
- run it again (quantized).
And also
- export it (which should trace it with the quantized kernels)
I don't see the hooks needed in Dense.quantized_call
and EinsumDense.quantized_call
and the variables to support that.
About the overall design:
GPTQConfig
is the global config.GPTQQuant
is most importantly the "state" (quantized kernel), although it also has config and some logic to determine this state. Note that the logic to dequantize is separate.GPTQ
is the wrapper that connects together one layer with oneGPTQQuant
, so there are manyGPTQ
instances, andGPTQ
is not where the core loop lives.
Let's find some names (or maybe even a different structure) that make it easier to follow this. I also find the splitting in 3 files (gptq.py
, gptqquant.py
, gptqutils.py`) a little hard to navigate.
self.group_size = group_size | ||
self.symmetric = symmetric | ||
self.act_order = act_order | ||
self.quantization_method = "gptq" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the purpose of holding the information about group size, symmetric quantization is there or not and if act_order is needed, this will be used while doing the gptq quantization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, github makes this confusing because it always shows 4 lines of context.
I meant, what is self.quantization_method = "gptq"
for?
It's never accessed and is implied by the fact that this is GPTQConfig
.
keras/src/quantizers/gptqquant.py
Outdated
return ops.multiply(scale, dequantized_x) | ||
|
||
|
||
class GPTQQuant: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Quant
stand for? Quantizer? Quantized(Kernel)? Quantization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Quant" stands for Quantization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call it GPTQQuantization
? Let's limit abbreviation when there is ambiguity.
colab link shared with jyoptinder
This commit integrates the GPTQ (Generative Pre-trained Transformer Quantization) algorithm into Keras. Key features include: - A new `GPTQConfig` for configuring quantization parameters. - Integration with base Keras models via a `model.quantize()` method. - Support for multiple datasets (WIKITEXT2,PTB, C4,custom dataset) and tested models (GPT-2, OPT, Bloom,gemma3 etc). - Includes unit tests to verify perplexity and model functionality post-quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a response to:
One of the expectations is that after quantization, it should be possible to
- run the model (using the quantized kernels)
- save it (keeping it quantized)
- reload it (keeping it quantized)
- run it again (quantized).
And also - export it (which should trace it with the quantized kernels)
I don't see the hooks needed in Dense.quantized_call
and EinsumDense.quantized_call
and the variables to support that.
else: | ||
# Test for valid cases where no error should occur | ||
try: | ||
model.quantize(mode, config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my PR level comment, can you add model.save
, then reload and verify it's still quantized.
from keras.src.quantizers.gptq_core import quantize_model | ||
|
||
|
||
@keras_export(["keras.GPTQConfig", "keras.quantizers.GPTQConfig"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should export as keras.GPTQConfig
, very few things should be at the top level. Any reason to do that?
keras.quantizers.GPTQConfig
alone works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
on their activation's second-order information. | ||
""" | ||
|
||
W = ops.transpose(ops.cast(self.layer.kernel, "float32")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout the code, there's a lot of use of single-letter capital variables, which is against our code style. Variables should be lowercase, with hyphens if needed, and they should use reasonable descriptive names rather than single letters/
analyze the model's activations. | ||
tokenizer: A `keras_nlp.Tokenizer` instance (or a similar callable) | ||
that is used to process the `dataset` if it contains strings. | ||
wbits (int, optional): The number of bits to quantize weights to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguments names don't follow code style. They should:
- Use hyphens, e.g.
num_samples
- Use "num" for number prefix
- Avoid abbreviations unless the word being abbreviated is extremely obvious.
For instance, percdamp
should probably be hessian_damping
|
||
@keras_export(["keras.GPTQConfig", "keras.quantizers.GPTQConfig"]) | ||
class GPTQConfig: | ||
"""Configuration class for the GPTQ algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring should feature code examples.
class GPTQConfig: | ||
"""Configuration class for the GPTQ algorithm. | ||
|
||
This class holds all the parameters needed to apply the GPTQ method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explain what the GPTQ method is and why/when a user should use it.
This commit integrates the GPTQ (Generative Pre-trained Transformer Quantization) algorithm into Keras.
Key features include:
GPTQConfig
for configuring quantization parameters.model.quantize()
method.